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We consider applying Bayesian Variable Selection Regression, or 
BVSR, to genome-wide association studies and similar large-scale re- 
gression problems. Currently, typical genome-wide association stud- 
ies measure hundreds of thousands, or millions, of genetic variants 
(SNPs), in thousands or tens of thousands of individuals, and attempt 
to identify regions harboring SNPs that affect some phenotype or out- 
come of interest. This goal can naturally be cast as a variable selection 
regression problem, with the SNPs as the covariates in the regression. 
Characteristic features of genome-wide association studies include 
the following; (i) a focus primarily on identifying relevant variables, 
rather than on prediction; and (ii) many relevant covariates may have 
tiny effects, making it effectively impossible to confidently identify the 
complete "correct" subset of variables. Taken together, these factors 
put a premium on having interpretable measures of confidence for 
individual covariates being included in the model, which we argue 
is a strength of BVSR compared with alternatives such as penalized 
regression methods. Here we focus primarily on analysis of quanti- 
tative phenotypes, and on appropriate prior specification for BVSR 
in this setting, emphasizing the idea of considering what the priors 
imply about the total proportion of variance in outcome explained by 
relevant covariates. We also emphasize the potential for BVSR to es- 
timate this proportion of variance explained, and hence shed light on 
the issue of "missing heritability" in genome- wide association studies. 
More generally, we demonstrate that, despite the apparent computa- 
tional challenges, BVSR can provide useful inferences in these large- 
scale problems, and in our simulations produces better power and 
predictive performance compared with standard single-SNP analy- 
ses and the penalized regression method LASSO. Methods described 
here are implemented in a software package, pi-MASS, available from 
the Guan Lab website http://bcm.edu/cnrc/mcmcmc/pimass. 
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1. Introduction. The problem of identifying relevant covariates in a re- 
gression model, sometimes known as variable selection, arises frequently in 
many fields. As computational and data-collection technologies have devel- 
oped, the number of covariates typically measured in these kinds of problems 
has steadily increased, and it is now not unusual to come across data sets 
involving many thousands or millions of covariates. Here we consider one 
particular setting where data sets of this size are common: genome-wide 
association studies (GWAS). 

Current typical GWAS [e.g., Wellcome Trust Case Control Consortium 
(2007)] measure hundreds of thousands, or millions, of genetic variants (typ- 
ically Single Nucleotide Polymorphisms, or SNPs), in hundreds, thousands, 
or tens of thousands of individuals, with the primary goal being to identify 
which regions of the genome harbor SNPs that affect some phenotype or 
outcome of interest. While many GWAS are case-control studies, here we 
focus primarily on the computationally-simpler setting where a continuous 
phenotype has been measured on population-based samples, before briefly 
considering the challenges of extending these methods to binary outcomes. 

Most existing GWAS analyses are "single-SNP" analyses, which simply 
test each SNP, one at a time, for association with the phenotype. Strong 
associations between a SNP and the phenotype are interpreted as indicating 
that SNP, or a nearby correlated SNP, likely affects phenotype. The primary 
rationale for GWAS is the idea that, by examining these SNPs in more 
detail — for example, examining which genes they are in or near — we may 
glean important insights into the biology of the phenotype under study. 

In this paper we examine the potential to apply Bayesian Variable Selec- 
tion Regression (BVSR) to GWAS (or other similar large-scale problems). 
Variable selection regression provides a very natural approach to analyzing 
GWAS: the phenotype is treated as the regression response, SNPs become 
regression covariates, and the goal of identifying genomic regions likely to 
harbor SNPs affecting phenotype is accomplished by examining the genomic 
locations of SNPs deemed likely to have nonzero regression coefficients. 
However, BVSR requires the use of computationally-intensive Markov chain 
Monte Carlo (MCMC) algorithms, and, prior to performing this work, it 
was unclear to us whether such algorithms could produce reliable results in 
a practical time- frame for problems as large as a typical GWAS. One impor- 
tant contribution of this paper is to show that, even using relatively simple 
MCMC algorithms, BVSR can indeed produce useful inferences in prob- 
lems of this size. Another important contribution is to discuss how BVSR 
should be used for GWAS analysis, with particular focus on choice of appro- 
priate prior distribution. Further, and perhaps most importantly, we give 
reasons why one might want to use BVSR to analyze GWAS — rather than 
less computationally-demanding approaches such as single-SNP analyses, 
or penalized regression approaches such as LASSO [Tibshirani (1996)] — by 
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emphasizing qualitative advantages of BVSR in this context. In particular, 
we emphasize that, unlike penalized regression approaches, BVSR naturally 
produces easily-interpretable measures of confidence — specifically, posterior 
probabilities — that individual covariates have nonzero regression coefficients. 
This is a particularly important advantage in GWAS because the primary 
goal of the analysis is to identify such covariates, and to use these identifica- 
tions to learn about underlying biology (in contrast to other settings where 
prediction may be the primary goal). 

Although our work is motivated by GWAS, many of the ideas and results 
should be of more general interest. In brief, the key elements are as follows: 

• We demonstrate that BVSR can be practical for large problems involving 
hundreds of thousands of covariates and thousands of observations. 

• We introduce some new ideas for prior specification in BVSR. In par- 
ticular, we emphasize the benefits of focusing on what the priors imply 
about the total proportion of variance in response explained by relevant 
covariates (henceforth abbreviated as PVE). We note that standard ap- 
proaches to prior specification in BVSR, which put the same priors on the 
regression coefficients irrespective of how many covariates are included in 
the model, imply that models with many relevant covariates are likely 
to have much larger PVE than models with few relevant covariates. We 
propose a simple alternative prior that does not make this potentially 
undesirable assumption, and has the intuitively appealing property that 
it applies stronger shrinkage in more complex models (i.e., models with 
more relevant covariates). 

• We emphasize the potential for BVSR to estimate the total amount of 
signal in a data set, specifically the PVE, even when there is insufficient 
information to reliably identify all relevant covariates. As a result, BVSR 
has the potential to shed light on the so-called "missing heritability" ob- 
served in many GWAS [Maher (2008); Yang et al. (2010)]. 

• We compare and contrast BVSR with a penalized-regression approach, 
the LASSO [Tibshirani (1996)]. Despite the considerable literature on 
both BVSR and penalized regression, there exist few comparisons (either 
qualitative or quantitative) of these two approaches. We chose the LASSO 
as a representative of penalized regression approaches both because of its 
popularity and because previous papers have applied it to the specific 
context of GWAS [e.g., Hoggart et al. (2008); Wu et al. (2009)]. In our 
limited simulation study BVSR outperforms LASSO in terms of predic- 
tive performance. In addition, we emphasize the qualitative advantage 
of BVSR over LASSO, and other penalized regression methods, that it 
produces posterior probabilities for each covariate having a nonzero re- 
gression coefficients. This qualitative advantage seems more fundamental, 
since predictive performance of different methods may vary depending on 
the underlying assumptions. 
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The remainder of the paper is organized as follows. In Section 2 we de- 
scribe BVSR and our choice of priors. In Section 3 we discuss computa- 
tion and inference, including Markov chain Monte Carlo algorithms used, 
and a Rao-Blackwellization approach to estimating the marginal posterior 
inclusion probability for each covariate. Section 4 reviews our main goals 
in applying BVSR to GWAS. In Section 5 we examine, through simula- 
tions, the effectiveness of BVSR for various tasks, including estimating the 
PVE, prediction, and identifying relevant covariates. For some of these tasks 
we compare BVSR with LASSO and single-SNP analyses. We also illus- 
trate BVSR on a GWAS for C-reactive protein. In Section 6 we briefly 
consider the challenges of extending our methods to deal with binary phe- 
notypes. Finally, in Section 7 we discuss some limitations and pitfalls of 
BVSR as we have applied it in this context, and potential future direc- 
tions. 

2. Models and priors. This section introduces notation and specifies the 
details of our BVSR model and priors used. Our formulation up to Sec- 
tion 2.1 is in the same vein as much previous work on BVSR, but with 
particular emphasis on putting priors on hyperparameters that are often 
considered fixed and known. Key relevant references include Mitchell and 
Beauchamp (1988), George and McCulloch (1993), Smith and Kohn (1996), 
Raftery, Madigan and Hoeting (1997) and Brown, Vannucci and Fearn (2002); 
see also Miller (2002) and O'Hara and Sillanpaa (2009) for more background 
and references. 

We consider the standard normal linear regression 

(2.1) y|/x,/3,X,r~Af„(/i + X/3,r-i/„), 

relating a response variable y to covariates X. Here y is an n-vector of 
observations on n individuals, /i is an n-vector with components all equal 
to the same scalar ^, X is an n by p matrix of covariates, /3 is a p-vector of 
regression coefficients, r denotes the inverse variance of the residual errors, 
Nn{-,-) denotes the n-dimensional multivariate normal distribution and In 
the n by n identity matrix. The variables y and X are observed, whereas 
fJ-,P, and T are parameters to be inferred. In more detail, y = (yi, . . . ,yn), 
where yi is the measured response on individual i, and X = (x.i, . . . ,x.p), 
where x.j = {xij, . . . , Xnj)'^ is a column vector containing the observed values 
of the jth covariate. For example, in the context of a GWAS, yi is the 
measured phenotype of interest in individual i, and Xij is the genotype of 
individual i at SNP j, typically coded as 0, 1 or 2 copies of a particular 
reference allele. [By coding the genotypes as 0, 1, or 2, we are assuming an 
additive genetic model. It would be straightforward to include dominant and 
recessive effects by adding another covariate for each SNP, as in Servin and 
Stephens (2007), e.g., although this would increase computational cost.] 
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In many contexts, including GWAS, the number of covariates is very 
large — and, in particular, p ^ n — but only a small subset of the covariates 
are expected to be associated with the response (i.e., have nonzero /3j). 
Indeed, the main goal of GWAS is to identify these relevant covariates. To 
this end, we define a vector of binary indicators 7 = (71, • • • ,7p) G {0, 1}^ 
that indicate which elements of /3 are nonzero. Thus, 



where denotes the design matrix X restricted to those columns j for 
which 7j = 1, and (3^ denotes a corresponding vector of regression coeffi- 
cients. In general, for observational studies one would be reluctant to con- 
clude any causal interpretation for 7, but in the context of GWAS, it is 
usually reasonable to interpret = 1 as indicating that SNP j, or an un- 
measured SNP correlated with SNP j, has a causal (functional) affect on y. 
This is because in GWAS reverse causation is generally implausible (phe- 
notypes cannot causally affect genotype, since genotype comes first tempo- 
rally) , and there are few potential unmeasured confounders that could affect 
both genotype and y [Smith and Ebrahim (2003)]. A well-documented ex- 
ception to this is population structure; here we assume that this has been 
corrected for prior to analysis, for example, by letting y be the residuals 
from regressing the observed phenotype values against measures of popula- 
tion structure, obtained, for example, by model-based clustering [Pritchard 
et al. (2000)] or principal components analysis [Price et al. (2006)]. 

Taking a Bayesian approach to inference, we put priors on the parameters: 



and 5q denotes a point mass on 0. Here vr, da. A, k, and are hyperparame- 
ters. The hyperparameters vr and aa have important roles, with vr refiecting 
the sparsity of the model, and aa refiecting the typical size of the nonzero 
regression coefficients. Rather than setting these hyperparameters to pre- 
specified values, we place priors on them, hence allowing their values to be 
informed by the data; the priors used are detailed below. (Later we will argue 
that this ability to infer vr and aa from the data is an important advantage 
of analyzing all SNPs simultaneously, rather than one at a time.) The re- 
maining hyperparameters are less critical, and, in practice, we consider the 



(2.2) 



y|7,/i,r,/3,X~Af„(/i + X-^/3 ,T /„) 
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posterior distributions for which o"^ — )■ oo and z^, k — >■ 0, which has the at- 
tractive property that the resulting relative marginal likelihoods for 7 are 
invariant to shifting or scaling of y. Thus, for example, inference of which 
genetic variants are associated with height would be unaffected by whether 
height is measured in meters or inches. [Taking these limits is effectively 
equivalent to using the improper prior p{h,t) oc 1/r, but we prefer to for- 
mulate proper priors and take limits in their posteriors, to verify sensible 
limiting behavior.] 

The parameter vr controls the sparsity of the model, and where the ap- 
propriate level of sparsity is uncertain a priori, as is typically the case, it 
seems important to specify a prior for vr rather than fixing it to an arbi- 
trary value. In GWAS, and probably in many other settings with extreme 
sparsity, uncertainty in vr may span orders of magnitude: for example, there 
could be just a few relevant covariates or hundreds. In this case a uniform 
prior on vr seems inappropriate, since this would inevitably place most of the 
prior mass on larger numbers of covariates (e.g., uniform on 10"^ to 10-3 
puts about 90% probability on >10-^). Instead, we put a uniform prior on 
logvr: 

(2.8) logir r^U{a,b), 

where a = log(l/p) and b = log(M/p), so the lower and upper limits on vr 
correspond, respectively, to an expectation of 1 and M covariates in the 
model. In applications here we used M = 400, with this arbitrary limit being 
imposed partly due to computational considerations (larger M can increase 
computing time considerably). The assumption of a uniform distribution is, 
of course, somewhat artificial but has the merit of being easily interpretable. 
An alternative, which may be preferable in some settings, would be a normal 
prior on log(7r/(l — vr)). 

The above formulation, with the exception of our slightly nonstandard 
prior on vr, follows previous work. However, since many formulations of 
BVSR differ slightly from one another, we now comment on some of the 
choices we made: 

(1) We chose, in (2.6), to put independent priors on the elements of (3^. 
An alternative common choice is Zellner's (7-prior [Zellner (1986); Agliari 
and Parisetti (1988)], which assumes correlations among the regression co- 
efficients mimicking the correlations among covariates, 

For GWAS we prefer the independent priors because we view the /3's as 
reflecting causal effects of X on y, and there seems to be no good reason to 
believe that the correlation structure of causal effects will follow that of the 
SNPs. 
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(2) Some authors center each of the vectors y and x.i,...,x.p to have 
mean 0, and set fi = 0. This approach yields the same posterior on 7 as our 
hmiting prior on fi (derivation omitted), and simphfies calculations, and so 
we use it henceforth. 

(3) It is common in variable selection problems to scale the covariates 
x.i,...,x.p to each have unit variance, to avoid problems due to different 
variables being measured on different scales. In GWAS these covariates are 
measured on the same scale, being counts of the reference allele, and so we 
do not scale the covariates in this way in our examples. However, one could 
so scale them, which would correspond to a prior assumption that SNPs 
with less variable genotypes (i.e., those with a lower minor allele frequency) 
have larger effect sizes; see Wakefield (2009) for relevant discussion. 

(4) The priors assume that the Pj are exchangeable, and, in particular, 
that all covariates are, a priori, equally plausible candidates to affect out- 
come y. In the context of a GWAS, this assumption means we are ignoring 
information that might make some SNPs better candidates for affecting out- 
come than others. Our priors also ignore the fact that functional SNPs may 
tend to cluster near one another in the genome. These choices were made 
purely for simplicity; one attractive feature of BVSR is that one could mod- 
ify the priors to incorporate these types of information, but we leave this to 
future work. 

(5) Some formulations of BVSR use a similar "sparse" prior, where the 
marginal prior on Pj is a mixture of a point mass at and a normal dis- 
tribution, whereas others [e.g., George and McCulloch (1993)] instead use 
a mixture of two normal distributions, one with a substantially larger vari- 
ance than the other. The sparse formulation seems computationally advanta- 
geous in large problems because sparsity facilitates certain operations (e.g., 
integrating out /3 given 7). 

2.1. Novel prior on <t^. While the above formulation is essentially stan- 
dard and widely used, there is considerable variability in how different au- 
thors treat the hyperparameter aa- Some fix it to an essentially arbitrary 
value, while others put a prior on this parameter. Several different priors 
have been suggested, and the lack of consensus among different authors may 
reflect the fact that most of them seem not to have been given a compelling 
motivation or interpretation. Here we suggest a way of thinking about this 
prior that we believe aids interpretation, and hence appropriate prior specifi- 
cation. Specifically, we suggest focusing on what the prior implies about the 
proportion of variance in y explained by Xj (the PVE). For example, almost 
all priors we have seen previously in this context assume independence of vr 
and (Ta, which implies independence of 7 and aa- While this assumption may 
seem natural initially, it implies that more complex models are expected to 
have substantially higher PVE. In our application this assumption does not 
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capture our prior beliefs. For example, it seems quite plausible a priori that 
there could be either a large number of relevant covariates with small PVE, 
or a small number of covariates with large PVE. 

Here we suggest specifying a prior on cr^ given 7 by considering the in- 
duced prior on the PVE, and, in particular, by making this induced prior 
relatively flat in the range of (0, 1). To formalize this, let V{P,t) denote the 
empirical variance of XP relative to the residual variance t~^: 

1 " 

(2.9) V{P,T):=-^[{XP)fT, 

i=l 

where this expression for the variance assumes that the covariates have been 
centered, and so X(3 has mean 0. Then the total proportion of variance in y 
explained by X if the true values of the regression coefficients are /3 is given 
by 

(2.10) PVE(/3, r) := V{p, t)/(1 + V{f3, r)). 

Our aim is to choose a prior on (3 given r so that the induced prior on 
PVE(/3, r) is approximately uniform. To do this, we exploit the fact that the 
expected value of V{(3,t) (with expectation being taken over /3|r) depends 
in a simple way on aa- 

(2.11) v{f,aa):=E[V{p,T)h,aa,T]=al Sj, 

where sj = ^ XliLi ^Ij is the variance of covariate j. Define 

(2.12) h{j, aa) = v{j, aa)/{l + v{-f, aa)). 

Intuitively, h gives a rough guide to the expectation of PVE for a given 
value of 7 and aa- (It is not precisely the expectation since it is the ratio of 
expectations, rather than the expectation of the ratio.) To accomplish our 
goal of putting approximately uniform prior on PVE, we specify a uniform 
prior on h, independent of 7, which induces a prior on aa given 7 via the 
relationship 

(2.13) alih,^) ^ ^ 



In all our MCMC computations, we parameterize our model in terms of 
(/i,7), rather than (0-0,7). Note that the induced prior on cr^ is diffuse: if 
Z = h/{l — h), and h ~ U{0, 1), then Z has a probability density function 
f{z) = 1/(1 + z)^, which is heavy tailed. 

Our prior on aa has interesting connections with the prior suggested by 
Liang et al. (2008). While Liang et al. (2008) use a g prior, if we consider 
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the case where the covariates are orthogonal with variances sj = 1, then 
their parameter g is effectively equivalent to our na'^. They suggest putting 
a Beta(l, a/2 — 1) prior on g/ {1+g), with a = 3 or 4; the case a = 4 is uniform 
on g/{l + g), or in our notation uniform on na'^/{l +na'^). In contrast, our 
prior is uniform on |7|cr^/(l + |7|cra). Thus, our aa is effectively n/|7| times 
the value of aa from Liang et al. (2008), and so our aa is larger than theirs 
(implying less shrinkage), provided that the number of relevant covariates I7I 
is less than n. Qualitatively, perhaps the main difference between the priors 
is that our prior applies less shrinkage (larger aa) in simpler models, which 
seems intuitively appealing. 

Of course, choice of appropriate prior distributions may vary according to 
context, and we do not argue that the prior used here is universally superior 
to other choices. However, we do believe the priors outlined above are suit- 
able for general use in most GWAS applications. In addition, we emphasize 
that these priors incorporate two principles that we believe should be help- 
ful more generally: first, it seems preferable to place prior distributions on 
the hyperparameters vr and aa, rather than fixing them to specific values, 
as this provides the potential to learn about them from the data; second, 
when comparing priors for aa, it is helpful to consider what the priors imply 
about PVE. 

3. Computation and inference. We use Markov chain Monte Carlo to 
obtain samples from the posterior distribution of (/i,7r,7) on the product 
space (0, 1) X (0, 1) x {0, 1}^, which is given by 

(3.1) p(/i,7r,7|y) oc p(y|/i,7)p(/i)p(7|7r)p(7r). 

Here we are exploiting the fact that the parameters (3 and r can be inte- 
grated out analytically to compute the marginal likelihood p{y\h,f). For 
each sampled value of h,-f from this posterior, we also obtain samples from 
the posterior distributions of /3 and r by sampling from their conditional 
distributions given y,j,h. 

Our Markov chain Monte Carlo algorithm for sampling /i,vr,7 is detailed 
in Appendix A. In brief, it is a Metropolis-Hastings algorithm [Metropo- 
lis et al. (1953); Hastings (1970)], using a simple local proposal to jointly 
update /i,7r,7. In particular, it explores the space of covariates included in 
the model, 7, by proposing to add, remove, or switch single covariates in 
and out of the model. To improve computational performance, we use three 
strategies. First, in addition to the local proposal moves, we sometimes make 
a longer-range proposal by compounding randomly many local moves. This 
technique, named "small-world proposal," improves the theoretical conver- 
gence rate of the MCMC scheme [Guan and Krone (2007)]. Second, and 
perhaps more importantly, when proposing new values for 7, and specif- 
ically when proposing to add a variable to the model, we focus more at- 
tention on those covariates with the strongest marginal associations. This 
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idea is related to the sure independence screen [Fan and Lv (2008)] (SIS), 
which uses marginal associations as an initial screening step. However, it is 
a "softer" use of these marginal associations than the SIS, because every 
variable continues to have positive probability of being proposed. Simula- 
tions (not shown) show that taking account of the marginal associations in 
this way dramatically increases the acceptance rate compared to a proposal 
that ignores the marginal associations. Finally, when estimating quantities 
of interest, we make use where possible of Rao~Blackwellization techniques 
[Casella and Robert (1996)], detailed below, to reduce Monte Carlo vari- 
ance. 

We note that our computational scheme is relatively simple, and one can 
create data sets where it will perform poorly, for example, multiple corre- 
lated covariates that are far apart along a chromosome, where an efficient 
algorithm would require careful joint updates of the 7^ values for those corre- 
lated covariates. (In our current implementation, swap proposals only apply 
to SNPs that are close to one another in the genome, which is motivated 
by the fact that correlations decay quickly with respect to distance between 
SNPs.) However, our main focus in this paper is not on producing a com- 
putational scheme that will deal with difficult situations that might arise, 
but rather on prior specification, and to provide an initial assessment of 
the potential for BVSR to be applied to large-scale problems. Indeed, we 
hope that our results stimulate more research on the challenging compu- 
tational problems that can occur in applying BVSR to GWAS and similar 
settings. 

3.1. Posterior inclusion probabilities via Rao-Blackwellization. In the 
context of GWAS, a key inferential question is which covariates have a high 
probability of being included in the model. That is, we wish to compute 
the posterior inclusion probability (PIP) of the jth covariate, Pr(7j = l|y). 
Although one could obtain a simple Monte Carlo estimate of this probabil- 
ity by simply counting the proportion of MCMC samples for which 7^ = 1, 
this estimator may have high sampling variance. To improve precision, we 
instead use the Rao-Blackwellized estimate, 

M 

(3.2) Pr(7, = l|y) « (1/M) J]Pr(7, = l\yn^:^^,P^:^^M'\h^'\^^''^), 

1=1 

where -f^'^\ j3^'^\ t^'^\ h^'^\ tt'^^'^ denote the ith MCMC sample from the pos- 
terior distribution of these parameters given y, and 'y_j and f3_j denote 
the vectors 7 and f3 excluding the jth coordinate. The probabilities that 
are being averaged here essentially involve simple univariate regressions of 
residuals against covariate j, and so are fast to compute for all j even when p 
is very large. Details are given in Appendix B. 
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3.2. Estimating proportion of variance explained. To perform inference 
on the total proportion of variance in y explained by measured covariates, we 
use samples from the posterior distribution of PVE(/3, r), which is defined at 
equation (2.10). These posterior samples are obtained by simply computing 
PVE(/3^*'',r'^*'*) for each sampled value of (3,t from our MCMC scheme. 

3.3. Predicting future exchangeable observations. Given observed covari- 
ates Xn+i for a future individual, we can predict a value of Un+i for that 
individual by 

(3.3) E{yn+i\y)=Xn+iE{P\y). 

To estimate E((3\y), we use the Rao-Blackwellized estimates 

M 

(3.4) E(/3,|y) ^ {l/M)Y,mhj = l,y,^3)Pr(7, = My,0^]). 

i=l 

Expressions for the two terms in this sum are given in Appendix B. 

3.4. Assessing predictive performance. Suppose that we estimate (3 to 
be f3. One way to assess the overall quality of this estimate is to ask how 
well it would predict future observations, on average. Motivated by this, we 
define the mean squared prediction error (MSPE): 

m 

(3.5) MSPE(^; (3, r) = E{Xp - yf = Y^Sj0 - + 1/t, 

i=i 

where /3 is the true value of the parameter, and sj is the variance of x.j, 
defined at (2.11). 

The MSPE has the disadvantage that its scale depends on the units of 
measurement of y. Hence, we define a relative prediction gain, RPG, which 
contrasts the MSPE from an estimated /3 with the prediction loss from 
simply predicting the mean of y for each future observation (MSPEq) and 
to the prediction error attained by the true value of (3 (MSPEopt = I/t): 

(Of,. MSPEo-MSPE(^) 

- MSPEo - MSPEopt • 

The RPG does not depend on r or on the scale of measurement of y, and 
indicates what proportion of the extractable signal we are successfully ob- 
taining from the data. For example, if the total proportion of variance in y 
explained by X~f is 0.2, then an RPG of 0.75 indicates that we are effectively 
able to extract three-quarters of this signal, leaving approximately 0.05 of 
the variance in y "unexplained." Note that RPG = if the prediction per- 
forms as well as the mean, and RPG = 1 if the prediction performs as well 
as the true value of /3. If RPG < 0, then the prediction is worse than simply 
using the mean, effectively indicating a problem with "overfitting." 
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4. Goals and expectations. At this point it seems helpful to review what 
we are attempting to achieve, and why it might be achievable despite the 
apparent severity of the computational burden. In brief, our primary goal 
is to extract more information from signals that exist in these data, par- 
ticularly marginal associations, than do standard single-SNP analyses that 
test each SNP, one at a time, for association with phenotype. (The vast 
majority of GWAS published so far restrict themselves to such single-SNP 
analyses.) One of the main difficulties in single-SNP analysis is to decide 
how confident one should be that individual SNPs are truly associated with 
the phenotype. This difficulty stems from the fact that confidence should de- 
pend on the unknown values of vr and aa- In a single-SNP analysis one must 
make assumptions, either implicitly or explicitly, about these parameters. 
An important aim of our approach is to instead estimate these parameters 
from the data, and hence provide more data-driven estimates of confidence 
for each SNP being associated with phenotype. To get intuition into why 
the data are informative about vr and aa, consider the following examples. 
First, suppose that in a GWAS involving 300,000 SNPs, there are 10 SNPs 
(in different genomic regions) that show very strong marginal associations. 
Then, effectively, we immediately learn that vr is likely at least of the or- 
der of 10/300,000 (and, of course, it may be considerably higher). Further, 
the estimated size of the effects at these 10 SNPs also immediately gives 
some idea of plausible values for aa (or, more precisely, for aa/r). Con- 
versely, suppose that in a different GWAS none of the 300,000 SNPs show 
even modest marginal associations. This immediately suggests that either vr 
or aa/r (or both) must be "small" (because if both were large, then there 
would be many strong effects, and we would have seen some of them). More 
generally, we note that the strength of the effect size at the most strongly 
associated SNPs immediately puts an upper bound on what kinds of effect 
size are possible, and hence an upper bound on plausible values for aa/r. In 
essence, BVSR provides a model-based approach to quantifying these quali- 
tative ideas, taking account of relevant factors (e.g., sample size) that affect 
the amount of information in the data. 

Another limitation of single-SNP analyses, at least as conventionally ap- 
plied, is that once some SNPs are confidently identified to be associated with 
outcome, they are not controlled for, as they should be, in analysis of sub- 
sequent SNPs. Controlling for SNPs that truly affect phenotype should help 
in identifying further such SNPs, and so a second key aim of our approach 
is to accomplish this. To see why our approach should attain this goal, note 
that our Rao~Blackwellization procedure for estimating marginal posterior 
inclusion probabilities is effectively a conventional single-SNP analysis that 
controls for the SNPs currently in the model. Thus, for example, if we start 
our MCMC algorithm from a point that includes the strongest marginal 
associations, it effectively immediately accomplishes this second goal. 
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We note two things we are not attempting to do. First, we are not at- 
tempting to identify a single best model (i.e., combination of SNPs), or to 
estimate posterior probabilities for specific models. In this context — and, we 
would argue, many other contexts where BVSR may be appropriate — these 
goals are of no interest, because the combination of small effect sizes and 
n mean that the posterior probability on any particular model is going 
to be very small, and the chance of identifying the "correct" model is effec- 
tively zero. Neither are we attempting to identify combinations of SNPs that 
interact in a nonadditive way to affect the phenotype — SNPs that have little 
marginal signal, but whose effect is only revealed when they are considered 
together in combination with others. While such combinations of SNPs may 
exist, and identifying them would be of considerable interest, this seems 
considerably more challenging, both statistically and computationally, than 
our more modest goals here. 

Finally, we note a particular feature of GWAS studies that may make it 
easier to obtain useful results from BVSR than in other contexts. Specifically, 
correlations among SNPs tend to be highly "local": each SNP is typically 
correlated with only a relatively small number of other SNPs that are near 
to it (linearly along the DNA sequence) , and any two randomly chosen SNPs 
are typically uncorrelated with one another. Put another way, the matrix 
X'X tends to have a highly banded structure, with large values clustering 
near the diagonal. To understand why this is helpful, note that one of the 
main potential pitfalls in applying MCMC to BVSR is that the MCMC 
scheme may get stuck in a "local mode" where a particular covariate {A, say) 
is included in the model, whereas in fact a different correlated covariate 
{B, say) should have been included. To help avoid getting stuck like this, 
the MCMC scheme could include specific steps that propose to interchange 
correlated covariates (e.g., remove A from the model and add B to the 
model), and the local correlation structure among SNPs in a GWAS means 
that this is easily implemented by simply proposing to interchange nearby 
SNPs. Furthermore, and perhaps more importantly, the local correlation 
structure means that getting stuck in such local modes may not matter 
very much, because if A and B are correlated, then they are also almost 
certainly close to one another in the genome, and hence implicate a similar 
set of genes, and correctly identifying a set of implicated genes is the ultimate 
goal of most GWAS analyses. 

5. Simulations and comparisons with other methods. We now present 
a variety of simulation results to illustrate features of our method, and as- 
sess its performance. Because our priors and methodology were primarily 
motivated by GWAS, these simulations are designed to mimic certain fea- 
tures of a typical GWAS. These include particularly that p ^ n (in our 
simulations p ^ 10,000-300,000 and n ^ 1,000), extreme sparsity (in most of 
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our simulations ~30 covariates affect response), and small effect sizes (most 
relevant covariates individually explain <1% of the variance of y). 

5.1. Simulation details. We performed simulations based on three dif- 
ferent genotype data, including both simulated and real genotypes. The 
first is simulated 10,000 independent SNPs (henceforth lOK), the second 
is real genotypes at ~317,000 SNPs (henceforth 317K), and the third is 
real genotypes at ~550,000 SNPs (henceforth 550K). Both 317K and 550K 
genotypes closely mimic real GWAS, and comparison between them can il- 
lustrate the scalability of our method. The lOK data set is helpful for several 
reasons: smaller simulations run faster; they allow us to assess methods in 
a simpler setting where computational problems are less of an issue; and 
the independence of the covariates avoids problems with deciding what is 
meant by a "true association" when covariates are correlated with one an- 
other. 

For the lOK data, we simulated genotypes as follows. At each SNP j = 
1,..., 10, 000 the minor allele frequency fj is drawn from a uniform distri- 
bution on [0.05,0.5], and then genotypes Xij (i = 1, . . . ,n) are drawn inde- 
pendently from a Binomial(2, /j) distribution. We use n = 1,000 and 6,000. 

Both 317K and 550K data sets come from an association study performed 
by the Pharmacogenomics and Risk of Cardiovascular Disease (PARC) con- 
sortium [Reiner et al. (2008); Barber et al. (2010)]. The 317K genotypes 
come from the Illumina 317K BeadChip SNP arrays for 980 individuals and 
the 550K genotypes come from the Illumina 610K SNP chip plus a custom 
13,680 SNP Illumina i-Select chip in 988 individuals (550K SNPs remain 
after QC). We replaced missing genotypes with their posterior mean given 
the observed genotypes, which we computed under a Hidden Markov Model 
[Scheet and Stephens (2006)] implemented in the software package BIMBAM 
[Guan and Stephens (2008)]. 

For both the simulated and real genotypes we simulated sets of phenotypes 
in the following way. First, we specified a value of PVE, the total proportion 
of variance in y explained by the relevant SNPs, that we wanted to achieve 
in the simulated data. Then we randomly selected a set of 30 "causal" SNPs, 
C, and simulated effect sizes /3j for each of these SNPs independently from 
an effect size distribution £{■) (discussed below). Next we computed the 
value of r that gives the desired value for PVE(/3,r) in equation (2.10). 
Finally, we simulated phenotypes for each individual using yi = ^j^c l^j^ij + 
iV(0,T-i). 

Unless otherwise stated, for the lOK SNP data sets we run BVSR for 1 
million iterations, and for the 317K and 550K SNP data sets we use 2 million 
iterations. Run times for each data set varied from a few minutes to about 
one day on a single MAC Pro with 3 GHz processor. (Note that the running 
time per iteration depends primarily on the inferred values for I7I, not the 
total number of SNPs.) 
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5.2. Other methods. In results presented below we compare our method 
with two other methods: simple single-SNP analysis that tests each SNP 
one at a time for association with phenotype, and the penalized regression 
method LASSO [Tibshirani (1996)]. 

For the single SNP analyses we ranked SNPs by their single-SNP Bayes 
factors, computed using equation (A. 2), with 7 in the numerator being the 
vector with jth component 1 and all other components 0, and averaging 
over CTq = 0.4,0.2, and 0.1 as in Servin and Stephens (2007). (Using standard 
single-SNP p values instead of Bayes Factors gives very similar performance 
in terms of ranking SNPs.) 

The LASSO procedure [Tibshirani (1996)] estimates (3 by minimizing the 
penalized residual sum of squares: 



For sufficiently large penalties. A, LASSO produces sparse estimates (3. Its 
main practical advantage over BVSR appears to be computational: for ex- 
ample, one can efficiently find the global optimal solution path for /3 as A 
varies. To apply the LASSO procedure, we used the lars package (v. 0.9-7) 
in R [Efron et al. (2004)]. 

5.3. Inference of PVE, and its relationship to heritability. The total pro- 
portion of variance in y explained by the relevant covariates Xj, or PVE, 
is commonly used to summarize the results of a linear regression. In GWAS 
the PVE is, conceptually, closely related to the "heritability" of the trait, 
which is widely used, for better or worse, as a summary of how "genetic" the 
phenotype is. The key difference between the PVE and heritability is that 
the PVE reflects the optimal predictive accuracy that could be achieved for 
a linear combination of the measured genetic variants, whereas heritability 
reflects the accuracy that could be achieved by all genetic variants. In recent 
GWAS, it has been generally observed, across a range of different diseases 
and clinical traits, that the proportion of phenotypic variance explained by 
"significant" genetic variants is much lower than previous estimates of heri- 
tability from family-based studies [Maher (2008)]. There are several possible 
explanations for this "missing heritability": for example, it may be that pre- 
vious estimates of heritability are inflated for some reason. However, two 
explanations have received particular attention: some of the missing heri- 
tability could reflect genetic variants that were measured but simply did 
not reach stringent levels of "significance" in standard analyses, while other 
parts of the missing heritability could reflect genetic variants that were un- 
measured (and not strongly correlated with measured variants). Because 
the measured genetic variants in current GWAS studies are predominantly 
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"common" genetic variants (those with a population frequency exceeding 
a few percent), the relative contribution of these two factors is connected to 
the contentious topic of the relative contributions of common vs rare vari- 
ants to phenotypic variation and disease risk [Pritchard (2001)]. Comparing 
the PVE with heritability should provide some insights into the relative 
contributions of these two factors. For example, at the simplest level, if the 
PVE is almost as big as the heritability, then this suggests that most phe- 
notypic variation is due to variation at SNPs that are highly correlated with 
measured genetic variants, and perhaps that rare genetic variants, which are 
usually not strongly correlated with measured common variants, contribute 
little to phenotypic variation. 

An important feature of BVSR that allows it to estimate the PVE, to- 
gether with measures of confidence, is its use of Bayesian model averaging 
(BMA) to average over uncertainty in which covariates are relevant. This 
is very different from single SNP analyses and standard penalized regres- 
sion approaches, which typically result in identification of a single set of 
potentially-relevant covariates, and so do not naturally provide estimates of 
the PVE that take account of the fact that this set may be missing some 
relevant covariates and include some irrelevant covariates. Since, as far as 
we are aware, the ability of BVSR to estimate PVE has not been examined 
previously, we performed simulation studies to assess its potential. 

For both real and simulated genotype data (described above), we simu- 
lated 50 independent sets of phenotype data, each containing 30 randomly- 
chosen "causal" SNPs affecting phenotype, varying PVE from 0.01 to 0.5 
in steps of 0.01. Our Bayesian model assumes, through the prior on /3, that 
the effect size distribution £ is normal. To check for robustness to deviations 
from this assumption, we simulated phenotype data using both £ = iV(0, 1) 
(as effectively assumed by our model) and iS = DE(l), where DE denotes 
the double exponential distribution. The results from these two different 
distributions were qualitatively similar, and so we show only the results for 
£: = DE(1). 

Figure 1 shows estimates of PVE obtained by our method against the true 
values. For both simulated and real SNP data there is a generally good cor- 
respondence between the true and inferred values, and 90% credible intervals 
(CI) for PVE covered the true value in 85% of cases. As might be expected, 
the uncertainty in PVE is greater when there is a larger number of SNPs, 
presumably due to the increased difficulty in reliably identifying relevant 
variants. In addition, the uncertainty in PVE tends to be greater when the 
true PVE is smaller. Our intuition is that when the data contain no SNPs 
with strong individual effects, it remains difficult to rule out the possibility 
that many SNPs may have very small effects that combine to produce an 
appreciable PVE. Nonetheless, even when the true PVE is small, the in- 
ferred posterior interval for PVE does exclude large values, illustrating that 
even in this case our method is able to extract information from the data. 
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Fig. 1. Comparison of true and inferred values for the proportion of variance in y ex- 
plained by relevant covariates (PVE). Panel (a) shows results for 1,000 individuals with 
10,000 independent simulated SNPs; Panel (b) shows results for 980 individuals with 317K 
real SNP genotypes. Circles indicate posterior mean for PVE; vertical bars indicate the 
symmetric 90% credible interval. 



5.4. Many causal SNPs with tiny effects. The simulations above involve 
30 causal SNPs explaining in total between 0.01 and 0.5 of the total variance 
in y. We note that this is a relatively subtle level of signal: in the following 
sections we will see that, for PVE = 0.30, and the sample sizes we used, it is 
typically not possible to confidently identify the majority of causal SNPs, nor 
to achieve the predictive performance that is similar to one would obtain 
if one knew the causal variants. Thus, to estimate the PVE, BVSR must 
not only identify variants that are confidently associated with y, but also 
estimate how many additional variants of small effects it might be missing 
and what their effect sizes might be. Clearly, there must be some limit to 
its ability to accomplish all these tasks: in particular, if there were very 
many variants of minuscule effects, then it would be difficult to distinguish 
this from the null model in which no variants have any effect. To try to 
test these limits, we ran more challenging simulations involving many more 
SNPs with tiny individual effects, but a nontrivial overall PVE. Specifically, 
we considered two cases: (i) 300 causal SNPs out of the lOK simulated 
SNPs in 1,000 individuals; and (ii) 1,000 causal SNPs out of the 317K real 
SNPs in 980 individuals. In each case we simulate the effect sizes using 
a normal distribution. We simulated 10 independent sets of phenotypes with 
PVE = 0.3 in each case. For comparison in each case we also simulated 10 
independent sets of phenotypes under a "null" model with no causal SNPs 
(PVE = 0). 

For these data sets, to give BVSR some chance to identify the large num- 
ber of causal SNPs, we increased M, the upper limit on the expected number 
of nonzero regression coefficients in our prior on vr, to M = 1,000. Plots of 
99% and 95% credible intervals for PVE in each simulation are shown in 
Figure 2. 
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Fig. 2. Plots showing estimation of PVE for simulations with large numbers of causal 
variants of very small effect. Panels (a) and (b) are for lOK and 317K data sets, respec- 
tively. The grey lines denote 99% CI and colored lines denote 95% CI. The blue color 
indicates null simulation (PVE = 0), red indicates alternative simulations (PVE = 0.3). 
The * denotes the median. 



Somewhat surprisingly, for the first set of simulations, with 300 causal 
SNPs out of lOK and PVE = 0.3, BVSR remains able to provide reasonable 
estimates of PVE: for example, for 5 of the 10 simulations the interquartile 
range of the posterior on PVE spans the true value of PVE = 0.3, and in 

7 simulations the 90% symmetric CI includes PVE = 0.3. Further, there is 
a clear qualitative difference between the results of PVE = 0.3 and PVE = 0. 
Less surprisingly, for the extremely challenging case of 1,000 causal SNPs 
out of 317K, the estimates of PVE are considerably less precise. However, 
even here, these admittedly limited simulations appear to show systematic 
differences between PVE = 0.3 and PVE = 0. For example, for PVE = 0.3, 

8 of the 90% CIs cover PVE = 0.2 and 6 CIs cover PVE = 0.3; whereas for 
PVE = 0, only 2 of the 90% CIs cover 0.2 and 1 CI covers 0.3. 

5.5. Identifying the causal SNPs. In existing GWAS the vast majority 
of studies published so far restrict their analysis to the simplest possible 
approach of testing each SNP, one at a time, for association with phenotype. 
One possible advantage of a multi-SNP analysis like ours is to improve power 
compared with this simple single-SNP approach. However, since each SNP is 
typically correlated with only a small number of other (nearby) SNPs, and 
so any two randomly chosen SNPs will be typically uncorrelated, the gain in 
power might be expected to be small (at least in the absence of interactions 
among SNPs). Further, one might be concerned that if our MCMC scheme 
does not mix adequately, then the results of the multi-SNP approach could 
actually be worse than those from a simpler analysis. 

We performed two types of simulations to investigate these issues, the 
first using the lOK data set (independent SNPs), and the second using the 
chromosome 22 of the 550K data set (9,041 correlated SNPs). In each case 
we simulated 100 phenotype data sets as described above, with 30 causal 
SNPs and PVE = 0.25. 
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For the lOK simulations we compared BVSR, single-SNP analyses, and 
LASSO in their ability to identify the causal SNPs as follows. For BVSR 
and single-SNP analyses we first computed, for each SNP, a measure of 
the evidence for association with phenotype. For BVSR we used the PIPs 
[equation (3.2)]; for single-SNP analysis we used the univariate Bayes Factor 
as described in Section 5.2. We then consider thresholding this measure of 
evidence: for any given threshold, we consider all causal SNPs exceeding the 
threshold to be true positives, and all other SNPs exceeding the cutoff to 
be false positives. We compare methods by constructing curves showing the 
trade-off between true positives and false positives as the threshold is varied. 
For LASSO, we first computed the solution path as A varies. Then, for each 
solution on this path we defined all causal SNPs with nonzero regression 
coefficients to be true positives, and all other SNPs with nonzero regression 
coefficients to be false positives. We then constructed curves showing the 
trade-off between true positives and false positives as A is varied. 

For the real (correlated) SNPs we performed a similar comparison, but 
assessed the methods in their ability to identify the correct genomic regions 
rather than individual SNPs. This is because the three methods differ quali- 
tatively in the way they identify SNP associations when SNPs are correlated 
with one another: single-SNP analyses tend to identify significant associa- 
tions at any SNP that is strongly correlated with a causal SNP; LASSO 
tends instead to select just one or a few correlated SNPs; and BVSR tends 
to spread the association signal (the PIPs) out among correlated SNPs. 
While it may be important to be aware of these qualitative differences when 
interpreting results from the methods, they are not our main interest here, 
and we assess the methods at the level of regions in an attempt to reduce 
the infiuence of these qualitative differences. (Further, it could be argued 
that identifying regions of interest is the primary goal of GWAS.) To de- 
scribe the approach in more detail, we partitioned chromosome 22 into 200 
kb nonoverlapping regions (different choices of region size that we tried pro- 
duced qualitatively similar results). We then used each method to assign 
each region a "region statistic" indicating the strength of the evidence for 
an association in that region. For single SNP analysis we used the maximum 
single SNP Bayes factor within each region; for BVSR we used the sum of 
the PIP for SNPs in the region; and for LASSO we used the penalty A at 
which any SNP in that region is included in the model. Similar to the SNP- 
level comparisons, we plot how true and false positive regions vary as the 
threshold on the region statistic is varied. (We averaged results over two 
different starting positions for the first window, 0, and 100 kb.) 

Figure 3 shows curves of the trade-off between true and false positives for 
each method in the two different simulations. Each point on the curve shows 
the total true vs false positives across the hundred simulated data sets, using 
a common threshold across data sets. (An alternative way to combine data 
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Fig. 3. Graphs showing the trade-off between true positive and false positive SNP iden- 
tifications for the different methods: BVSR [blue), LASSO {red), and single SNP analyses 
(black). Both plots show results that are summed across 100 data sets {see text for further 
explanation) . Panel (a) is for independent simulated SNPs; Panel (b) is based on real 
genotype data for chromosome 22. 

sets is to use a different threshold in each data set, vary the thresholds in 
such a way as to produce the same number of positive findings in each data 
set; the two different ways to combine data sets give similar results.) 

For a given number of false positives, the multi-SNP approaches (BVSR 
and LASSO) always yield as many or more true positives than the single- 
SNP analysis. For the lOK simulated SNPs BVSR and LASSO perform 
similarly, whereas for the real genotypes BVSR is better. (The reasons for 
this difference are unclear to us.) The results demonstrate that, even in the 
case where single-SNP tests might be expected to perform extremely well — 
that is, independent SNPs with no interactions — it is still possible to gain 
slightly in power by performing multi-SNP analyses. Our intuitive explana- 
tion for the gain in power of the multi-SNP approaches is that, once one 
identifies a causal variant, controlling for it will improve power to detect 
subsequent causal variants. Because the SNPs are independent, this gain is 
expected to be small: indeed, if the SNPs were exactly orthogonal, then one 
would expect no gain by controlling for identified variants. However, our re- 
sults show that even in the case of independent SNPs the gain is measurable 
because the finite sample size produces nonzero sample correlations between 
"independent" SNPs. 

We note that, at least in these simulations, most of the gain from the 
multi-SNP methods occurs when the number of false positives is small but 
nontrivial: that is, the multi-SNP methods promote some of the moderately- 
difficult-to-detect causal SNPs slightly higher in the SNP rankings, but not 
so far as to put them at the very top. This suggests that multi-SNP analysis 
may be most useful when used in combination with other types of data or 
analysis that attempt to distinguish true and false positives among the SNPs 
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near the top of the association rankings [as in Raychaudhuri et al. (2009), 
e.g., where information on gene similarities taken from PubMed abstracts 
are used in this way]. 

5.6. Prediction performance. We used the same simulated data as in the 
previous section to compare predictive performance of BVSR and LASSO. 
To measure predictive accuracy, we use the relative prediction gain, defined 
at (3.6). For our method we compute RPG(/9) where P is the posterior 
mean for (3. For LASSO we compute the RPG in two ways, which we will 
refer to as RPGi and RPG2. For RPGi we first compute RPG(/3(*)) for 

each in the solution path for f3 output by the lars package, and take 
the minimum of these relative prediction errors. Note that by taking the 
minimum over A in this way we are effectively assuming that an oracle has 
given us the optimal value for A; in practice, one would need to obtain A 
through other means, such as cross-validation, which would result in worse 
accuracy than RPGi . For RPG2 we take a two-stage approach to prediction. 
First, we use LASSO to select the SNPs that should have nonzero coefficients 
(using the A used for RPGi), and then we estimate the regression coefficients 
of these SNPs using ordinary least squares (/3oLs)' compute RPG2 := 
RPG(/3oLs)- The motivation for this procedure is that if LASSO is able 
to reliably identify the correct coefficients, then the refitting procedure will 
improve predictive performance by avoiding the known tendency for LASSO 
to overshrink nonzero regression coefficients; however, as we shall see below, 
the refitting can be counterproductive when the correct coefficients are not 
reliably identified. 

Figure 4 compares the RPG obtained from the three methods on 100 sim- 
ulated data sets. The RPG from our Bayesian approach is higher than that 
obtained directly from the optimal LASSO solution (RPGi) in 82 of the 100 
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Fig. 4. Comparison of the relative prediction gam (RPG) for BVSR (x-axis) and LASSO 
(y-axis). Black circles are results from the optimal LASSO solution without refitting 
(RPGi); Red crosses are corresponding results with refitting (RPG2), described in the 
main text. 
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data sets, and mean RPG is higher (0.315 vs 0.261). The refitting procedure 
has a substantial effect on predictive accuracy, and, in particular, it sub- 
stantially increases the variance of the performance: for some data sets the 
refitting procedure improves predictive performance, but for the majority 
of data sets it results in much worse RPG. Indeed, RPG2 is often nega- 
tive, indicating that predictive performance after refitting is substantially 
worse than simply using the mean phenotype value, which is the symptom 
of "overfitting." This behavior makes intuitive sense: in cases when the op- 
timal LASSO solution does a good job of precisely identifying many of the 
relevant covariates, and no irrelevant ones, the refitting step improves pre- 
dictive performance, but when the first stage includes several false positives 
the refitting procedure is counter-productive. 

Although our Bayesian model is sparse, our estimated /3 is not sparse 
due to the averaging in (3.4). In some contexts one might want to obtain 
a sparse predictor, so, to examine how this might impact predictive accuracy, 
we computed the RPG for each data set using only the P covariates with 
highest posterior inclusion probabilities (setting other coordinates of /3 to 0) , 
where P = 10, 30, 100. The average RPG for these sparse estimates of /3 
were essentially unchanged from using the nonsparse estimate (3 (RPG = 
0.313,0.315, and 0.315, resp.). 

We also examined the benefits of using Bayesian model averaging (BMA) 
to perform prediction, by computing the RPG obtained using only those 
covariates with a posterior inclusion probability >t where t = 0.2, 0.5, 0.8. 
[When t = 0.5 this is the "median probability model" of Barbieri and Berger 
(2004).] Specifically, we computed the RPG for fi^ = /(Pr(7j = 1) > t)E{j3j\ 
7j = 1), where the two quantities on the right-hand side are estimated 
from (3.2) and (3.4). These estimates have some shrinkage because E{l3j = 
1) is a shrinkage estimate of (3j (due to the normal prior on /3), but they 
do not have the additional shrinkage term Pr(7j = 1) that BMA provides to 
further shrink variables that are not confidently included in the model. The 
average RPG's for these non-BMA estimates were notably worse than for 
the BMA-based estimates: 0.244,0.291, and 0.272, respectively, compared 
with 0.315 for BMA. 

Taken together, these results suggest that BMA is responsible for a mod- 
erate amount of the gain in predictive performance of BVSR compared with 
RPGi, with some of the remainder being due to LASSO's tendency to over- 
shrink estimates of the nonzero regression coefficients. One way to think 
of this is that LASSO has only a single parameter. A, that controls both 
shrinkage and sparsity. In this setting the true solution is very sparse, so A 
needs to be big enough to keep the solution sufficiently sparse; but hav- 
ing A this big also creates an overly strong shrinkage effect. In contrast, 
BVSR effectively avoids this problem by having two parameters, (Ta control- 
ling shrinkage, and vr controlling sparsity. As we have seen, in this context 
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the strategy of refitting the (3 coefficients at the LASSO solution fails to 
improve average predictive performance. Other possible ways around this 
problem include using a more flexible penalized regression model (e.g., the 
Elastic Net [Zou and Hastie (2005)] has two parameters, rather than one), 
or using a procedure that does not overshrink large effect sizes, for exam- 
ple, SCAD [Fan and Li (2001)]. Comparisons of these methods with BVSR 
would be an interesting area for future work. 

5.7. Calibration of the posterior inclusion probabilities. One of the main 
advantages of BVSR compared with Bayesian single-SNP analysis methods 
is that BVSR allows the hyperparameters vr and aa to be estimated from 
the data, and thus provides data-driven estimates of the posterior inclusion 
probabilities (PIPs). One hope is that estimating these parameters from the 
data will lead to better-calibrated estimates of the PIPs than the single-SNP 
approach which effectively requires one to supply educated guesses for these 
parameters. To assess this, Figure 5(a) shows the calibration of the PIPs 
from BVSR, for the simulations used in the estimation of PVE above (fifty 
data sets with PVE = 0.01-0.5 for both normal and exponential effect size 
distributions). The figure shows that the PIPs are reasonably well calibrated. 
In particular, SNPs with high PIP have a high probability of being causal 
variants in the simulations. 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

Posterior Inclusion Probability Posterior Inclusion Probability Posterior Inclusion Probability 



(a) (b) (c) 

Fig. 5. Calibration of the posterior inclusion probabilities (PIPs) from BVSR. The graph 
was obtained by binning the PIPs obtained from BVSR m 20 bins of width 0.05. Each point 
on the graph represents a single bin, with the x coordinate being the mean of the PIPs 
withm that bin, and the y coordinate being the proportion of SNPs in that bin that were 
true positives {i.e., causal SNPs in our simulations) . Vertical bars show ±2 standard errors 
of the proportions, computed from a binomial distribution. Panel (a) is the result of BVSR, 
using the priors described here. The fact that the points lie near the line y = x indicates 
that the PIPs are reasonably well calibrated, and thus provide a reliable assessment of the 
confidence that each SNP should be included in the regression. Panel (b) is the result from 
BVSR fixing n to be either 5x smaller {black star) or 5x larger {blue cross) than the true 
value {oa fixed to true value). Panel (c) is the result of fixing Oa to be either 5x smaller 
{black star) or 5x larger {blue cross) than the true value (vr fixed to true value). 
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To illustrate the potential benefits of using moderately-diffuse prior dis- 
tributions on vr and Ua, allowing their values to be informed by the data, 
rather than fixing them to specific values, we also applied BVSR with ei- 
ther TT or Ga fixed to an "incorrect" value (approximately 5 times larger or 
smaller than the values used in the simulations). Figure 5(b) and (c) show 
how, as might be expected, this can result in poorly-calibrated estimates of 
the PIP (of course, if one were lucky enough to fix both vr and ctq to their 
"correct" values, then calibration of PIPs will be good, but, in practice, the 
correct values are not known). We note that fixing a a to be five- fold too 
large seems to have only a limited detrimental effect on calibration, which is 
consistent with the fact that in single-SNP analyses, with moderate sample 
sizes, BFs are relatively insensitive to choice of Ga provided it is not too 
small [e.g., Stephens and Balding (2009), Figure 1]. This suggests that, in 
specifying priors on cja, it may be prudent to err on the side of using a dis- 
tribution with too long a tail rather than too short a tail. Note that, as 
in Bayesian single-SNP analyses, although the numerical value of the PIP 
is sensitive to choice of vr, the ranking of SNPs is relatively insensitive to 
choice of vr (and, indeed, Ua)- Consequently, in contrast to the calibration 
plot, power plots of the kind shown in Figure 3 are not sensitive to choice 
of prior on either vr or aa (results not shown). 

5.8. Real data analysis: PARC GWAS for C-reactive protein. We ap- 
plied BVSR to analyze a GWAS study to identify genetic variants associ- 
ated with plasma C-reactive protein (CRP) concentration. CRP is a protein 
found in the blood that is associated with inflammation, and is predictive 
of future cardiovascular disease [Ridker et al. (2002)]. The data come from 
the Pharmocogenetics and Risk of Cardiovascular Disease (PARC) study 
[Reiner et al. (2008) and references therein]. 

The available genotype data consisted of 1968 individuals genotyped on ei- 
ther the Illumina 317K chip (980 individuals) or the Illumina 610K SNP chip 
plus a custom 13,680 SNP Illumina i-Select chip (988 individuals). These 
genotype data had undergone basic quality control filters (e.g., removing 
SNPs with very high proportions of missing data, or showing strong depar- 
tures from Hardy- Weinberg equilibrium) prior to our analysis. To merge the 
two data sets, we used genotype imputation [Servin and Stephens (2007); 
Marchini et al. (2007)], using the software package BIMBAM [Guan and 
Stephens (2008)] to replace missing or unmeasured genotypes with their 
posterior mean given the observed genotype data [see Guan and Stephens 
(2008) for discussion of this strategy]. After imputing missing genotypes, 
we removed SNPs with (estimated) minor allele frequency <0.01, leaving 
a total of 530,691 SNPs. 

The phenotype data consisted of plasma concentrations of CRP, mea- 
sured multiple times for each individual, both before and after exposure to 
statin drugs. These multiple measures were adjusted for covariates (age, sex. 
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smoking status, and body mass index), quantile normalized to a standard 
normal distribution, and averaged to produce a single summary measure of 
CRP concentration for each individual (relative to other individuals in the 
same study), as described in Reiner et al. (2008). 

After removing individuals with missing phenotypes, we had phenotype 
and genotype data on a total of 1,682 individuals. We performed four inde- 
pendent MCMC runs, two with 2 million iterations, and two using 4 million 
iterations. These longer runs took approximately 60 and 90 CPU hours on 
a Mac Pro 3 GHz desktop. Comparing results among runs, we found three of 
the runs gave very good agreement in all aspects we examined, whereas the 
fourth run showed mild but noticeably greater departure from the others, 
suggesting possible convergence or mixing issues. For example. Figure 6(a) 
compares the estimated PIPs for each pair of runs, and Figure 6(b) compares 
the estimated posterior distribution of PVE among runs. The remainder of 
the results in this section are based on pooling the results from all four runs. 

The usual way to summarize single-SNP analyses is to report the SNPs 
with the strongest marginal evidence for association. Thus, it might seem 
natural in a multi-SNP analysis to focus on the SNPs with the largest pos- 
terior inclusion probabilities (PIPs). However, this can be misleading. For 
example, if there are many SNPs in a region that are highly correlated with 
one another, and all approximately equally associated with the phenotype, 
then it may be that the correct conclusion is that at least one of these SNPs 
should be included in the model, but there might be considerable uncertainty 
about which one. In this case, even though the posterior probability of at 
least one SNP being included in the model would be high (near 1), none of 
the individual PIPs may be very big, and concentrating on the PIPs alone 
would risk missing this signal in the data. To avoid this problem, we prefer 
to initially summarize results at the level of regions, as we now illustrate. 

We divided the genome into overlapping regions, each 1 Megabase (10^ 
bases) in length, with the overlap between adjacent regions being 0.5 Megaba- 
ses. For each region we computed two quantities: (i) an estimate, E, of the 
posterior expected number of SNPs included in the model, being the sum 
of the estimated PIPs for all SNPs in the region; (ii) an estimate of the 
probabilities, P, that the region contains (a) 1 SNP, (b) 2 SNPs, or (c) more 
than 2 SNPs included in the model. The latter quantities (ii) are perhaps 
the most natural summary of the evidence that the region harbors genetic 
variants affecting phenotype, but (i) has the advantage that it can be easily 
approximated using Rao-Blackwellization, resulting in lower Monte Carlo 
error. Thus, in practice, we suggest examining both quantities, and placing 
more trust in (i) where the two disagree. 

The results are summarized in Figure 7, which also shows results for 
a single permutation of the phenotypes for comparison. The plot clearly 
identifies two regions with very strong evidence for an association with CRP 
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(a) 

Fig. 6. Illustration of the consistency of results across four different runs of the MCMC 
algorithm for the CRP data. In panel (a) the {i,j)th plot compares results for runs i 
and j . Plots in the upper triangle {j > i) compare estimated posterior inclusion probabilities 
(PIPs) for each SNP. Plots m the lower triangle compare estimated posterior expected 
number of SNPs in 1 Mb regions {so each point corresponds to a single region). The line 
y = X is marked in blue. Panel (b) shows posterior distributions of PVE from the four 
MCMC runs. 

in both plots (e.g., E > 0.95), and a third region with moderately strong 
evidence (e.g., E > 0.75). Multiple other regions show modest signals {E = 
0.1 to 0.5), that might generally be considered worthy of follow-up in larger 
samples, although at this level of signal the majority are, of course, unlikely 
to be truly associated with CRP. 

The three regions with the strongest association signals contain the genes 
CRP, HNFIA, and APOE/APOC, all of which have shown robustly-repHca- 
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ted SNP associations with C-reactive protein levels in several other GWAS 
using single-SNP analyses [e.g., Reiner et al. (2008); Ridker et al. (2008)]. 
In addition, in these data, these three regions all contain single SNPs show- 
ing strong associations: the largest single-SNP Bayes factors in each of these 
regions are 10^'^, 10^'^, and 10^'^, respectively. Thus, in this case the identifi- 
cation of regions of interest from BVSR is largely concordant with what one 
would have obtained from a single-SNP analysis. However, we highlight two 
advantages of the BVSR analysis. First, the estimated posterior probabilities 
obtained for each region are easier to interpret than the single-SNP Bayes 
Factors. For example, the estimated posterior probability that the HNFlA 
region contains at least one SNP included in the model is 0.96, and this seems 
much more helpful than knowing that the largest single-SNP Bayes Factor 
in the region is 10^'^. Similarly, for the next most associated region, which is 
on chromosome 10 near the gene FAM13C1, the posterior probability of 0.42 
is simpler to interpret than the fact that the largest single-SNP Bayes Factor 
is 10^'^. And while these single-SNP BFs are easily converted to posterior 
probabilities of association by specifying a prior probability of association 
(effectively vr in our model), the multi-SNP analysis reduces the risk of spec- 
ifying an inappropriate value for vr by learning about vr from the data. 

A second advantage of BVSR is its ability to estimate the PVE. To il- 
lustrate this, we first consider a typical single-SNP analysis in this context, 
which estimates the PVE for "significant" SNPs by performing ordinary 
least-squares regression on those SNPs. Applying this approach to these 
data, using a relatively liberal (by GWAS standards) threshold for signifi- 
cance (single-SNP BF >10^), we find that significant SNPs explain approx- 
imately 6% of the overall variance in CRP after controlling for covariates. 
Comparing this with some previous estimates of heritability of CRP in the 
range 0.35-0.4 [Pankow et al. (2001); Lange et al. (2006)] suggests that 
a substantial amount of genetic variation influencing CRP remains to be 
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Fig. 7. For eac/i 1 Mb region we show an estimate from BVSR that the region contains 
1 [black), 2 [red), or more than 2 (green with ★) SNPs m the regression. The 1 Mb regions 
overlap by 0.5 Mb, and so any SNP with a large PIP would cause a signal to occur in 2 
adjacent regions on the plot. Panel (a) shows sum of PIP in each 1 Mb region {truncated 
at 1). Panel (b) shows estimated probabilities that each genomic region harbors variants 
associated with CRP levels. In panel (c) we permute phenotype once and produce the same 
plot as a comparison. 



identified, a feature tliat has become known as "missing heritability." One 
question of interest is to what extent this shortfah might be explained by 
measured genetic variants that simply failed to pass the significance thresh- 
old, vs being explained by unmeasured genetic variants or other factors. To 
assess this, we examine PVE obtained from applying BVSR on measured 
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SNPs. The posterior distribution for PVE [Figure 6(b)] has mean 0.14, with 
a symmetric 90% CI of [0.05, 0.25]. Note that, as one might expect, the lower 
part of this CI is similar to the estimated PVE of "significant" SNPs. Be- 
cause most of the posterior distribution lies above 0.06, we infer that larger 
studies of the same set of SNPs might be expected to uncover considerably 
more signal than this study. (Consistent with this, a larger study involv- 
ing 6,345 women typed at a subset of the SNPs considered here identified 
four additional genome regions containing SNPs associated with CRP levels 
[Ridker et al. (2008)].) Conversely, the fact that the upper part of the CI 
(0.25) remains well short of previous estimates of heritability suggests that 
not all of the missing heritability is likely to be explained by simply con- 
ducting larger studies of the same SNPs, and that some alternative factors 
(e.g., unmeasured rare variants) may also contribute. 

6. Extension to binary phenotypes. Although we have focused here on 
quantitative traits, BVSR is also potentially applicable to binary pheno- 
types, and this is important for GWAS applications because they often in- 
volve binary phenotypes. In this section we briefly summarize our attempts 
to extend BVSR in this way. 

A standard approach to applying BVSR to binary phenotypes is to use 
a probit link function. In practice, this is usually accomplished by introduc- 
ing latent variables z which are assumed to follow the standard linear regres- 
sion (2.2), and to be related to the observed outcomes y by i/j = 1 {zi > 0) 
[Albert and Chib (1993)]. Posterior inference is performed by integrating 
out z using Markov chain Monte Carlo, which requires implementation of 
only one additional update compared with the quantitative trait (an update 
of the z variables). 

A nice feature of this probit-based approach using latent Gaussian vari- 
ables is that it would allow us to use the same priors as for quantitative out- 
comes, except that these priors now relate to the unobserved latent (Gaus- 
sian) variables, rather than the observed (binary) outcomes. Furthermore, 
we can continue to summarize the overall signal by estimating the PVE of the 
latent variables. However, the way we have set things up, with an improper 
prior on r, this would lead to improper posteriors on r and z [because the 
likelihood p(y|z) is unchanged by multiplying z by any positive constant]. 
This could be rectified in a number of ways. For example, we could fix r 
[e.g., to 1, as in Albert and Chib (1993)]. Here we instead choose to impose 
an identifiability constraint directly on the elements of z, by constraining 
them to have (empirical) variance 1, because this allows us to re- use ex- 
actly the same computer code as for the quantitative phenotypes (whereas 
fixing r would necessitate some changes). In addition, in an attempt to im- 
prove mixing, we make the approximation that the marginal distribution of 
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the elements of z will be normal, which should be a reasonable approxima- 
tion under the linear regression model (2.2) provided that there are no very 
large values for /3. Specifically, we restrict zi, . . . , z„|y to take a fixed set of 
values, being the n equally-spaced quantiles of a standard normal A^(0, 1) 
distribution, with the values corresponding to the uq individuals with yi = 
being constrained to be the first uq of these quantiles. The intuitive moti- 
vation for this constraint is that it can reduce the potential to fall into poor 
local optima by ruling out implausible configurations of z that correspond to 
some SNPs having very large effects. (Of course, this may not be a good idea 
in settings where very large effects are more plausible.) With this constraint 
in place, local Metropolis-Hastings proposals for z simply involve randomly 
picking a pair of individuals with the same (binary) phenotype value 
and proposing to swap the values of Zi and Zj. (For long-range proposals, 
we simply compound this local proposal randomly many times.) 

To provide a brief illustration of the potential for this approach, we ap- 
plied the method to some simple simulated data sets. The genotypes were 
simulated in the same way described in Section 5.1, using 10,000 indepen- 
dent SNPs genotyped in n = 1,000 and 6,000 individuals. We simulated 
latent normal phenotypes by randomly selecting 30 causal SNPs and simu- 
lating a quantitative phenotype z with prespecified PVE as in Section 5.1. 
We then converted these n quantitative phenotypes to n binary phenotypes 
by mapping the largest n/2 z values to y = 1 and the remainder to y = 0. 
Figure 8 illustrates how reliably we are able to infer the PVE of the latent 
variables from the binary data. More generally, we find that provided we 
limit analyses to thousands of SNPs, we are able to obtain generally reliable 
results for binary traits (e.g., results from multiple independent runs largely 
agree with one another). Thus, for example, we should be able to obtain 
reliable results for small genomic regions, such as individual genes, which 
can itself be of considerable interest [Servin and Stephens (2007)]. However, 
our experience with larger real data sets involving hundreds of thousands of 
SNPs indicates that mixing is, as one might expect, harder for binary traits 
than for quantitative traits, and that to obtain reliable results in practice 
for GWAS may require longer MCMC runs and/or further methodological 
innovation. 

7. Discussion. In this paper we have demonstrated that BVSR can be 
successfully applied to large problems, with a particular focus on genome- 
wide association studies. We have argued that BVSR has several potential 
benefits compared with standard single-SNP analyses, among them the abil- 
ity to obtain data-driven estimates of hyperparameters that must otherwise 
be specified more subjectively by the user, and the ability to estimate the 
overall signal (PVE) that might be accounted for by relevant covariates, even 
when confidently identifying the relevant covariates is not possible. We have 
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Fig. 8. Comparison of true and inferred values of PVE for binary phenotypes. The es- 
timated PVE is on the y-axis and the true PVE on the x-axis. Panels (a) and (b) are 
for n = 1,000 and n = 6,000 individuals, respectively. Circles indicate posterior mean for 
PVE; vertical bars indicate the symmetric 90% credible interval. 



also introduced a novel, more interpretable, approach to prior specification 
in BVSR, and shown that BVSR can provide a competitive alternative to 
the penalized regression procedure LASSO. 

However, despite our generally upbeat assessment, there are a number of 
potential limitations of the methods we have described here, which present 
both pitfalls to be aware of in practice, as well as challenges and opportuni- 
ties for future work. 

One important aspect of analysis of any GWAS is the potential for data 
quality to adversely impact results. For example, although modern geno- 
typing technologies provided very high quality genotypes on average, some 
SNPs are much harder to genotype accurately than others, and genotyping 
error can occur at some SNPs at an appreciable rate. This can cause false 
positive associations if genotyping error is correlated with phenotype (which 
it can be, particularly in case-control studies if the DNA quality differs ap- 
preciably between cases and controls [Clayton et al. (2005)]). While quality 
control is vital to any study, it is of potentially even greater import in multi- 
SNP analyses than in single-SNP analyses, because in multi-SNP analyses 
the association results at one SNP affect the results at other SNPs, and so 
low quality data at a few SNPs may impact estimated associations at other 
SNPs. Thus, it seems particularly important to attempt to impose strin- 
gent data quality filters before embarking on a computationally-intensive 
multi-SNP analysis. 

One limitation of the methods we present here is the assumption that 
effect sizes are normally distributed. Although our simulations with expo- 
nential effect sizes suggest a certain amount of robustness to this assumption, 
it is important to note that there are some phenotypes where the normality 
assumption is clearly wrong. For example, in type 1 diabetes, one region of 
the genome, the MHC, contains genetic variants whose effect on phenotype 
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may be substantially greater than any other region. When such regions of 
unusually large effect are known, it would be prudent to run methods like 
ours both including and excluding data at these loci, to check for robust- 
ness of conclusions. More generally, the robustness of our BVSR could be 
improved by replacing the assumption of normally-distributed effects with 
a heavy-tailed distribution such as a t with small or moderate degree of 
freedom, or indeed with a prior on the degrees of freedom. 

Another related issue is that we assume the residual distribution of the 
phenotypes to be normal. To improve robustness to this assumption, we 
typically normal quantile transform the observed phenotypes to have a nor- 
mal distribution before analysis (which while not strictly ensuring that the 
residuals are normal, does in our experience limit problems that might oth- 
erwise be caused by deviations from normality, such as occasional outlying 
values). Again, the use of a t distribution for the residuals might be prefer- 
able. 

We view the work presented here as just the very start of what could be 
done with BVSR in GWAS. One important extension would be to incorpo- 
rate additional information into the prior distribution on which variables are 
included in the regression (7 in our notation). Here we have assumed that 
variables are included in the model, independently, with common probabil- 
ity vr. This independence assumption ignores likely local spatial dependence 
of 7. In particular, it would be unsurprising to see multiple functional vari- 
ants occurring in a single gene, and, indeed, analyses of genetic data in the 
CRP gene have suggested that it contains multiple SNPs affecting CRP lev- 
els [Verzilh et al. (2008); Stephens and Balding (2009)]. The independence 
assumption in the prior we use here makes it overly skeptical about this 
possibility. Another important possibility is that one could allow the prior 
probability of each SNP being included in the regression to depend on an- 
notations of the SNP, such as where it lies relative to a gene, or whether it 
lies in a genomic region that is conserved across several species (a sign that 
the region may be functional). Of course it is not generally known a priori 
how much such annotations should affect the prior inclusion probabilities. 
However, with BVSR one could estimate hyperparameters that affect the 
prior inclusion probabilities from the data [Veyrieras et al. (2008)]. 

Finally, despite our focus on GWAS, many of the issues we have discussed 
here have broad relevance. In particular, while the computational challenges 
of BVSR remain considerably greater than penalized regression methods, 
we believe that the qualitative advantages of BVSR make it worth investing 
effort into designing more efficient inference algorithms for BVSR, to be 
able to better deal with the very large-scale applications that are becoming 
increasingly common. 
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APPENDIX A: DETAILS OF MCMC SCHEME 

We use Markov chain Monte Carlo to obtain samples from the posterior 
distribution of {h, vr, 7) on the product space of (0, 1) x (0, 1) x {0, 1}*', which 
is given by 

(A.l) p(/i,7r,7|y) oc p(y|/i,7)p(/i)p(7|7r)p(7r). 

Here we are exploiting the fact that the parameters (3 and r can be integrated 
out analytically to compute the marginal likelihood p{y\h,-f). Indeed, in the 
limit for the hyperparameters A, k — )■ and o"^ — > 00 that we use here, we 
have 

P(y|/^,7) 1/2,0,1/2 1 / yV-y^^^y V"/^ 
^'^■'> p(y|/i,7 = 0) '''' <Ta{h,jp\\ y*y-ny2 J ' 

where J7 := (cja(^, 7)~^/|7j + Xt^Xj)^^ and denotes the p-vector of all Os. 
[For derivation, see Servin and Stephens (2007), Protocol SI equation (13).] 
Note that here aa{h,f) is given by equation (2.13). 

For each sampled values of h,y from this posterior, we obtain samples 
from the posterior distributions of /3 and r by sampling from their condi- 
tional distributions given y,7,/i: 

T|y , /i, 7 ~ r(n/2, 2/(yV - y'X^nx'^y)), 

(A.3) f3^\T,y,h,j^N{nxt,y,{l/T)n), 

/3-7k,y,/i,7~^o- 

Our Markov chain Monte Carlo algorithm for sampling h, vr, 7 is based on 
a Metropolis-Hastings algorithm [Metropolis et al. (1953); Hastings (1970)], 
using a simple local proposal to jointly update /i,7r,7. In outline, the local 
proposal proceeds as follows. First a new proposed value of 7, 7', is obtained 
by small modification of the current value (see below for more details); then 
a new value of vr is proposed from a Beta(|7'|,p— [7'[ + 1) distribution; finally 
a proposed new value for h is obtained by adding a C/(— 0.1,0.1) random 
variable to the current value (reflecting proposed values that lie outside 
[0,1) about the boundary). The proposal distribution for tt is proportional 
to its full conditional distribution given 7' inside the finite range of the prior 
on vr [given by (2.8)]; on the infrequent occasions that the proposed value 
for vr lies outside this range, it is of course rejected. 

In addition to the local proposal described above, we sometimes (with 
probability 0.3 each iteration) make a longer-range proposal by compound- 
ing randomly-many local moves [the number being uniform on (2, . . . ,20)]. 
This technique, named "small-world proposal," improves the theoretical con- 
vergence rate of the MCMC scheme [Guan and Krone (2007)]. 
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We now give details on our update proposal for 7. When adding a covari- 
ate into the model we use a rank based proposal that focuses more attention 
on covariates that are more likely to be included in the model. To do this, we 
first rank the covariates based on their association with phenotype y (specif- 
ically we rank them by the Bayes factor for the model including only that 
covariate vs the null model containing no covariates, evaluated at ctq = 1). 
Let Qt be a distribution on (0, ... ,t — 1) which has decreasing probability. 
Here we choose Qt to be a mixture Qt = 0.3Ut + 0-7Gt, where Ut is a uniform 
distribution on {0, . . . ,t — 1} and Gt is a geometric distribution truncated 
to {0, . . . ,t — 1}, with its parameter chosen to give a mean of 2,000. 

Now let 7+ denote the set of covariates that are currently in the model, 
7+ = {i-'yi = 1}. Let 7" denote the complimentary set. We define three 
different types of moves, namely, add a covariate, remove a covariate, and 
exchange a pair of covariates in and out of the current model. Each move 
starts by setting 7' = 7. Then we randomly choose among the following: 

• Add covariate: Generate r ~ Qp-k, and find the covariate i G 7" that has 
rank r (among covariates in 7"). Set 7^ = 1. 

• Remove covariate uniformly: Uniformly pick i G 7"*", and set 7^ = 0. 

• Add a covariate and remove another: Pick i uniformly from 7"*" and j 
uniformly from 7", and set 7^ = 0;7^- = 1. 

In our current implementation, at each update we randomly select among 
these moves with probabilities 0.45, 0.45, and 0.1. 

APPENDIX B: CALCULATIONS FOR RAO-BLACKWELLIZED 

ESTIMATES 

In this appendix we derive the calculations need to compute the terms in 
equation (3.2). 

Let 9-j denote the parameters (7_j, /3_j, r, /i, vr). Note that 
(B.l) Pr(7, = l|y,^„,)- — 



1 + A' 



where 

A 

(B.2) 



p{ij = o\y,0-j) 

p{y\jj = 1, e^j) p{P_j hj = 1, h, n) p{^j = l|7_j, r, h, tt) 
P{y\lj = 0, O^j) p(/3_j-|7j = 0, 7-j, T, h, tt) p{jj = 0\j-j,T, h, n) 

^ p{y\jj = l,e^j)p{(3__j\-fj = l,7-j,T,/t) TT 

p{yhj = o> 0~j) p(.P-jhj = Oi 7-i> /i) 1 - vr ■ 

The second term here arises because in our parameterization /3_j is not 
independent of 7^- (because its prior variance, o"a, is a function of h,j). 
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This term is easily computed from the fact that (3_j\^,T,h are i.i.d. ~ 
N{0,aHh,-f)/T). 

To compute the numerator of the first term note that 

(B.3) y|7i = 1, e_, ~ N{X^.jl3^_, + /i + X,-/?^, l/r/), 

with the priors on [from (2.3)] being 

li\T^N{0,al/T), 

(B.4) 

/3,\T^NiO,al/r). 

Integrating out /i, /3j gives 

(B.5) p(y|7. = l,r) = (27r)~"/V"/2i^^expf--(i?*i?-i?*Xf)X*i2)TY 

where R = y — X~^-.j/3-y^j, ■y — j denotes the vector obtained by taking 7 

and setting the jth coordinate to 0, 17 = {X^X + z> = (^'^^ ^2^, 

and X = {l,Xj) is an n x 2 design matrix whose first column is all Is. [See 
equation (8) from Protocol SI in Servin and Stephens (2007).] The posterior 
distribution on /3j is given by 

(B.6) /3j|y, e-jr^N{nx^R,n). 

Similarly, to compute the denominator of the first term, we use 
(B.7) y|7, = 0, e., ~ N{X^^j(3^., + fi, (1/r)/), 

with priors on //|r ~ A^(0,(j^/r). Integrate out fi to get 

1/2 

(B.8) p(y|7, =0,t) = (27r)-"/V"/^^exp(^-^(i2*i2-Qon2i?2)^^, 

where JIq = (^"^^ + and R = ^Y^Ri. 
From this we obtain 

(B.9) = = B^^^expf ^-{R^XnX^R - n^n^R^' 

^ ' p(y|7,=0,^_,) 171/2 ^^2^ 

In the limit 00 we have ^ n and ^ ^ {^^ ^2^ and the above ex- 

pression becomes 

(B.io) p{y\i^ = ^.o~,) ^ \^\i/2^,^J L^Rtxnx^R - uR 

^ ' p(y|7i = 0,^-j) ' ' aa 

Note that this calculation effectively involves a univariate regression of 
the residuals R against covariate j. Furthermore, all covariates j ^ 7^ use 
the same residuals: only for j G 7+ do the residuals need to be recomputed. 
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Acronyms used in the paper. 

• BMA: Bayesian model averaging 

• BVSR: Bayesian variable selection regression 

• GWAS: genome wide association studies 

• LASSO: least absolute shrinkage and selection operator, a popular variable 
selection method 

• MCMC: Markov chain Monte Carlo 

• PIP: posterior inclusion probability 

• PVE: proportion of variance explained 

• RPG: relative prediction gain 

• SNP: single nucleotide polymorphism 

• SIS: sure independence screen, a two-stage variable selection procedure. 
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